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Abstract—With the rapid development of mobile technolo- 
gies, social networking softwares such as Twitter, Weibo 
and WeChat are becoming ubiquitous in our every day 
life. These social networks generate a deluge of data that 
consists of not only plain texts but also images, videos, and 
audios. As a consequence, the traditional approaches that 
classify the short text by counting only the key words become 
inadequate. In this paper, we propose a multimedia short 
text classification approach by deep RNN(Recurrent neural 
network ) and CNN(Convolutional neural network) cascade. 
We first employ an LSTM(Long short-term memory) net- 
work to convert the information in the images into text 
information. Then a convolutional neural network is used 
to classify the multimedia texts by taking into account both 
the texts generated from the image as well as those contained 
in the initial message. It is seen through experiments using 
MSCOCO dataset that the proposed method exhibits signifi- 
cant performance improvement over the traditional methods. 


I. INTRODUCTION 


In the era of big data, the traditional form of information 
has been changed. Most obviously, the long text form has 
already been replaced by the short text form. For example, 
we can use Twitter to publish our messages with limited 
140 words. Meanwhile, the plain texts has been gradually 
replaced by messages of images and short plain texts. 
There is an instance from WeChat’s Circles:“ Today is 
a special day for every American!” with a picture that 
draws Trump and Hillary. If we just use picture and text 
separately, we can get no conclusion. But if content was 
extracted from this image and combine the content with 
original text, we will know this message should belong to 
the category of politics. And we may draw a conclusion 
that today is the presidential election day. 

Traditionally, there were two common methods to clas- 
sify multimedia short texts. The first method is to classify 
the raw data by image information without considering the 
text parts [1]. The second method is to classify the raw 
data by text information without considering the image 
parts. There are two types of text classifiers. One is 
text classifier such as mixture modeling text classifier, 
probabilistic and naive Bayes classifier [2], [3] and so 
on. The other is CNN(Convolutional neural network) text 
classifier which is based on deep learning [4], [5]. 

In this paper, we try to combine original texts and image 
contents to classify multimedia short texts. How to extract 
useful information from image? What rules should obey 
when we extract information from images? We propose a 
new RNN-CNN model to answer these two problems and 
obtains a better result. 
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The remainder of our paper is organized as follows: 
we review some related works in section 2. We give 
the architecture of proposed model and clearly describe 
the details of the model in section 3. The section 4 
demonstrates effective result of the proposed model with 
experiments. At last, we make a whole summary of this 


paper. 
II. RELATED WORKS 


With the advent of different social softwares, the mes- 
sages are organized in a variety of ways and organised 
by many new features, such as “@username” feature 
[6] ,slangs, percentage signs and so on.It is difficult for 
traditional models to summarize so many data features. The 
traditional input form— bag of words(BOW) comes across 
data sparsity problem [6], [7], [8]. The text classifiers 
which are based on deep learning are getting more and 
more popular. Word representation vectors as the input 
of neural network exert great influence on classifying 
results. The effective pre-trained word embeddings can be 
obtained according to neural network [9], [10], [11]. Not 
only word embedding vectors keep the relationships of 
words, but also satisfy the context semantic relationships. 

As described before, images may have a strong asso- 
ciation with original texts in most instances when they 
are in the same multimedia text. Recently, some images 
classification methods which depend on deep learning 
have got an improved achievement [12], [13], so that we 
can get its category just depending on images content. 
For example, when there is a picture painted with a 
solider, there is a high probability that it belongs to 
military category. In this paper we take images information 
into account when classifying multimedia short texts by 
methods based on deep learning. If we want to combine 
original texts and image contents together as a new input 
of CNN text classifiers, they must be projected to the same 
vector space so that they can be used as the unified input 
of CNN classifiers. In addition, when we want to extract 
useful sentences from pictures, these sentences should be 
grammatical and consistent with the picture content. There 
are lots of models about image caption work [14], [15]. 
But they all encounter data sparsity problem in the big 
data field. Recently, some special RNN (Recurrent neural 
network) models are applied to the machine translation 
field and make important achievements [16], [17], [18]. 

With the development of computer vision [19], the 
accuracy of image classification has arrived at a high level. 
These image recognition methods can not describe the 
content of pictures effectively when images are blurry. 
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Figure 1. 


In this paper we apply LSTM model [20] to generate a 
sentence for describing an image. Even if the images are 
blurry, the sentences can also give some relative verbs and 
adjectives which are helpful for classifying images. The 
core of LSTM model is to predict every word in sentences 
correctly. The Kiros team [21] used a convolutional neural 
network to predict next word based on given sentences 
and pictures. There are some other works which applied 
recurrent neural network to the prediction tasks [22], [23]. 


III. RNN-CNN CASCADE MODEL 
A. Overall of RNN-CNN 


Our model consists of an image caption model and a 
CNN text classifier. At first, we use a pre-trained CNN 
model to encode an image J as a fix-length vector [24]. 
The “show and tell” model [20] will be adopted to generate 
sentences subsequently. The objective function p(S | I) is 
taken to train the LSTM model, where S represents the 
generated sentence. The correct sentence representation 
of an image will be obtained when p(S | J) arrives at 
maximum value. Then we connect this sentence with an 
original short text as the CNN text classifier input. This 
end-to-end model is called RNN-CNN(Recurrent neural 
network cascade with Convolutional neural network) clas- 
sifiers as in Figure 1. 

In the process of generating descriptions, the model 
have to meet two requirements which have been described 
in Introduction. The RNN model is selected to generate 
descriptions because the RNN model can save contextual 
relationships. 

For a given image in the multimedia short text, the 
best image description can be computed by conditional 
probability as follows. 
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Where S = (s1,--+-,87) represents the sentence generated 
by RNN model. s;,¢ = 1,---,Z represents a word in 
the sentence S. T is the number of the words in the 
sentence and it is an unbound value. J is an input image. In 
the process of RNN , the model generates a single word 
s+ at each iteration. We employ GoogLeNet pre-trained 
model on ILSVRC 2014 dataset for object recognition and 
detection [25], which is the largest image classification 
dataset at present. The iteration will be continued unless 
it generates the character of <EOS>. 


B. LSTM Model 


Each iteration of RNN has three inputs and one output, 
t represents the numbers of iterations. One of the inputs 
is the memory h+, which is a hidden state with fixed 
length. The memory h+ saves information that generated 
from beginning to end and updated at each time step 
with another input x+. The last input is the memory cell 
state c+—1. The output of each iteration is probability 
distribution over all words. 


hig = f (ht, Xt) (2) 


We choose LSTM as f(-). The LSTM is designed to 
avoid long-term dependency problem and it has carefully 
been designed for the structure of gates. The gate is 
a method which controlling information from hı. 
The formulae to calculate the memory cell output and 
update the memory cell state through gates are as follows: 


fi = o(W faxt + Werke) (3) 

i, = o(Wizxı + Wipnht) (4) 

Ce = fi, OCr-1 +i © tanh(Werx, + Wenhi) (5) 
0, = o( Worx + Wonht) (6) 

hi+ı = 04 © tanh(cz) (7) 
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Figure 2. This is an example 


Pt+1 = Softmax(hr41) (8) 


o is a sigmoid function. © is an element-wise mul- 
tiplication operation.The various W matrices are trained 
parameters.The forget gate f, decides how much infor- 
mation should be dropped. The value of f; is between 
0 (drop completely )and 1 (keep completely). The input 
gate i, controls updating process. It decides which part 
of the input x, and the last hidden state h; should be 
updated to the memory cell c;. The output gate o; decides 
which part of a memory cell c, should be outputted. The 
tanh is a hyperbolic tangent function. The p:+1 represents 
the probability distribution over all words. The initialized 
input of LSTM is an image which is disposed by CNN. 
The image I is only input once. 


xı = CNN(D (9) 


The output sequence looks like (s1 ---s y). Our loss is the 
sum of the negative log likelihood of the correct word at 
each step as follows: 


N 
L = -X logp:(s:) (10) 
t=1 


pı represents the probability of generating correct word 
S+. When sum of these probabilities arrives at the largest 
level, the loss function arrives at the minimal value. 
When getting the sentences generated by LSTM model, 
we have to use some metrics to evaluate the generated 
sentences [26]. Now we use the CIDEr [27] as main 
metric and use the BLEU [28] as auxiliary metric. The 
higher score the sentence get, the more effective they are, 
both based on CIDEr and BLEU. 


C. CNN Model 


The following is the whole process of RNN-CNN 
model (Figure 1).The first layer is an input layer. The 
sentence of S = {s1,82,---,sn} is generated by LSTM 


of multimedia short text application 


model. N represents word numbers in the longest sen- 
tence. Y = {y1,yo,---,yr’} is the original caption in 
a multimedia short text. Now we connect them as P = 
{r1 r2, Fn hP = S@Y. The n equals the sum of 
N plus T’. We apply word embedding from scratch to 
represent each word in the sentence. For the sentence 
P = {rj,ro,--+,rn}, it is projected to a matrix M. 
M € R™** is obtained by looking up table. k is the 
dimension of word embedding. Every word in the sentence 
be the k-dimensional word vector. 

In the convolution layer, we use the rj.;,;~1 to rep- 
resent concatenation of words r;,ri4i1,°-*,Fi+j—1-j is the 
window kernel size of convolution. The weight We Rk 
is applied to the convolution operation to produce a new 
feature. The process of generating new feature c; is as 
follows: 


Ci = fw ‘Tii+j—1 +b) (11) 


f is a non-linear function . be R is a bias term. We 
apply the kernel with window size j to slide sentence 
P = {r1,r2,: +- ,Fn} for getting a feature map: 


. ee F4| (12) 


ce R"-J*! is generated with a window kernel size of 
j.The main purpose of convolutional layer is to strengthen 
the local features. The different convolutional kernels with 
different sizes can extract multiple features. 

The following is max pooling layer. The max pooling 
operation select the maximum value ê = mazx{c} on c 
as the feature. The maximum value can capture the most 
obvious feature of each feature map.The max pooling 
layer can reduce parameters number and decrease time 
complexity of whole model. It also can solve the problems 
of variable sentence lengths . 

The next layer is fully connected softmax layer with 
lg-norms dropout operation. Some neural units are ig- 
nored for preventing co-adaptation of hidden units. After 


c= [C1,C2,°° 


being disposed by the max pooling layer, the output is 
z = [C1,--++,@m]. m represents the number of filters. The 
process of dropout is as following: 

h=W -(qoz)+b (13) 
h is a output unit in forward propagation. W” is weight 
matrix. © is the element-wise multiplication operator. 
q€ R” is a ’masking’ vector which all elements’ values 
in it are coincidence of Bernoulli variables. This dropout 
operation mainly for avoiding overfitting. Meanwhile, the 
Softmax operation is applied as a classifier to generate 
the probability distribution over categories. 

Our experiment takes multichannel architecture, which 
showed in Figure 1. One is static channel, which these 
parameters keep static in process of training. The other 
is non-static channel, which these parameters can be 
fine-tuned via backpropagation. The multichannel has 
better performance on many datasets. Because the 
fine-tune operation makes vectors more close to specific 
duty. At the same time, the non-static operation limits the 
changing distance of vectors.Figure 2 shows an example 
of multimedia short text application. As we can see, our 
model can do a better classification than the traditional 
CNN model. 


IV. EXPERIMENT 


MSCOCO [29] is a suitable dataset for multimedia short 
text classifying. Because MSCOCO contains 91 categories 
and each category consists of many multimedia short 
texts. For testing the robustness of our model, we conduct 
three groups of experiments to test the result respectively 
based on two-category, three-category and five-category 
dataset(Figure 3). 

The two-category experiment is based on vehicle cate- 
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Figure 3. This is the dataset graph for training and testing 

gory which includes five subclasses (car, boat, truck, air- 
plane, train) and animal category which includes five sub- 
classes (giraffe, zebra, bear, sheep, elephant). The three- 
category dataset is based on the two-category dataset and 
furniture category dataset which includes five subclasses 
(bed, couch, closestool, table, chair). The five-category 


set Two- Three- Five- 
method Category | Category | Category 
text(CNN) 0.86512 0.80000 0.76594 
image(ResNet34) 0.90625 0.84961 0.80097 
image-text(RNN-CNN) | 0.95116 0.95504 0.94778 
Table I 


THE CLASSIFICATION ACCURACY OF PROPOSED RNN-CNN METHOD 
AGAINST OTHER MODELS 


dataset added sports category dataset which includes five 
subclasses (skis,snowboard,kite,baseball,skateboard) and 
food category dataset which includes five subclasses (sand- 
wich,cake,orange,broccoli,carrot) on the basis of three- 
category dataset. In order to ensure the fairness and accu- 
racy of our experiment, we extract the 1000 multimedia 
short texts of each class for training, each subclass has 
200 multimedia short texts. At the same time, we extract 
the 300 multimedia short texts of each class for testing, 
each subclass has 60 multimedia short texts. 

We use the caption texts and their labels together to train 
and test with text CNN. Then we apply images and their 
labels together to train and test with image ResNet. At last, 
the RNN-CNN model (the CIDEr value is 85) does a new 
classification experiment by means of using multimedia 
short texts. Then we compare these three cases’ results 
carefully (Table 1). 

As we can see, the classification accuracy of image 
ResNet34 model is four percent higher than the average 
of text CNN model. But the classification accuracy of 
image-text RNN-CNN model is more higher than the 
image ResNet34 model. Meanwhile, the accuracy of image 
ResNet34 and text CNN model decreases as classification 
quantity increases, while image-text RNN-CNN model 
maintains almost the same accuracy on the two-category 
and multi-category classification. The RNN-CNN model 
has higher stability and robustness. 


V. CONCLUSION 


This paper takes new data pattern into consideration and 
proposes an end to end novel classifier model that can 
automatically classify the new media data. The traditional 
image classifier can capture image features. But the image 
can not be recognized correctly when it’s blurry. When 
changing image to sentence, we can get helpful semantic 
relationship like some verbs and adjectives which are 
very useful for classifying. When the generated sentence 
connects with the original description together as input, as 
our experiment shows, the accuracy and stability of RNN- 
CNN model all get great improvement. The applicability 
of RNN-CNN model will be extensive so that it can be 
widely applied to data classification of new media. 
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